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Abstract. Steepest descent preconditioning is considered for the recently proposed nonlinear 
generalized minimal residual (N-GMRES) optimization algorithm for unconstrained nonlinear op- 
timization. Two steepest descent preconditioning variants are proposed. The first employs a line 
search, while the second employs a predefined small step. A simple global convergence proof is 
provided for the N-GMRES optimization algorithm with the first steepest descent preconditioncr 
(with line search), under mild standard conditions on the objective function and the line search pro- 
cesses. Steepest descent preconditioning for N-GMRES optimization is also motivated by relating 
it to standard non-preconditioned GMRES for linear systems in the case of a standard quadratic 
optimization problem with symmetric positive definite operator. Numerical tests on a variety of 
model problems show that the N-GMRES optimization algorithm is able to very significantly accel- 
erate convergence of stand-alone steepest descent optimization. Moreover, performance of steepest- 
descent preconditioned N-GMRES is shown to be competitive with standard nonlinear conjugate 
gradient and limited-memory Broyden-Fletcher-Goldfarb-Shanno methods for the model problems 
considered. These results serve to theoretically and numerically establish steepest-descent precondi- 
tioned N-GMRES as a general optimization method for unconstrained nonlinear optimization, with 
performance that appears promising compared to established techniques. In addition, it is argued 
that the real potential of the N-GMRES optimization framework lies in the fact that it can make 
use of problem-dependent nonlinear preconditioners that are more powerful than steepest descent 
(or, equivalently, N-GMRES can be used as a simple wrapper around any other iterative optimiza- 
tion process to seek acceleration of that process), and this potential is illustrated with a further 
application example. 

Key words, nonlinear optimization, GMRES, steepest descent 

AMS subject classifications. 65K10 Optimization, 65F08 Preconditioners for iterative meth- 
ods, 65FI0 Iterative methods 

1. Introduction. In recent work on canonical tensor approximation [3J, we have 
proposed an algorithm that accelerates convergence of the alternating least squares 
(ALS) optimization method for the canonical tensor approximation problem consid- 
ered there. The algorithm proceeds by linearly recombining previous iterates in a 
way that approximately minimizes the residual (the gradient of the objective func- 
tion), using a nonlinear generalized minimal residual (GMRES) approach. The re- 
combination step is followed by a line search step for globalization, and the resulting 
three-step non-linear GMRES (N-GMRES) optimization algorithm is shown in [3] to 
significantly speed up the convergence of ALS for the canonical tensor approximation 
problem considered. 

As explained in [3] (which we refer to as Paper I in what follows), for the tensor 
approximation problem considered there, ALS can also be interpreted as a precondi- 
tioner for the N-GMRES optimization algorithm. The question then arises what other 
types of preconditioners can be considered for the N-GMRES optimization algorithm 
proposed in Paper I, and whether there are universal preconditioning approaches that 
can make the N-GMRES optimization algorithm applicable to nonlinear optimization 
problems more generally. In the present paper, we propose such a universal precon- 
ditioning approach for the N-GMRES optimization algorithm proposed in Paper I, 
namely, steepest descent preconditioning. We explain how updates in the steepest 
descent direction can indeed naturally be used as a preconditioning process for the 



'Department of Applied Mathematics, University of Waterloo, Waterloo, Ontario, Canada 
§ hdcstcrck@uwatcrloo.ca 



1 



2 



H. De Sterck 



N-GMRES optimization algorithm. In fact, we show that steepest descent precon- 
ditioning can be seen as the most basic preconditioning process for the N-GMRES 
optmization method, in the sense that applying N-GMRES to a quadratic objective 
function with symmetric positive definite (SPD) operator, corresponds mathemati- 
cally to applying standard non-preconditioned GMRES for linear systems to the linear 
system corresponding to the quadratic objective function. We propose two variants of 
steepest descent preconditioning, one with line search and one with a predefined small 
step. We give a simple global convergence proof for the N-GMRES optimization al- 
gorithm with our first proposed variant of steepest descent preconditioning (with line 
search), under standard mild conditions on the objective function and for line searches 
satisfying the Wolfe conditions. The second preconditioning approach, without line 
search, is of interest because it is more efficient in numerical tests, but there is no 
convergence guarantee. Numerical results are employed for a variety of test problems 
demonstrating that N-GMRES optimization can significantly speed up stand-alone 
steepest descent optimization. We also compare steepest-descent preconditioned N- 
GMRES with a standard nonlinear conjugate gradient (N-CG) method for all our test 
problems, and with a standard limited-memory Broyden-Fletcher-Goldfarb-Shanno 
(L-BFGS) method. 

We consider the following unconstrained nonlinear optimization problem with as- 
sociated first-order optimality equations: 

OPTIMIZATION PROBLEM I: 

find u* that minimizes /(u). (I- 1 ) 
FIRST-ORDER OPTIMALITY EQUATIONS I: 

V/(u) = g(u) = 0. (1.2) 

The N-GMRES optimization algorithm proposed in Paper I for accelerating ALS 
for canonical tensor approximation consists of three steps that can be summarized as 
follows. (Fig. 1 1.1 1 gives a schematic representation of the algorithm, and it is described 
in pseudo-code in Algorithm [1] ) In the first step, a preliminary new iterate u^+i is 
generated from the last iterate using a one-step iterative update process M{.), 
which can be interpreted as a preconditioning process (see Paper I and below). ALS 
preconditioning is used for M(.) in Paper I. In the second step, an accelerated iterate 
Uj+i is obtained by linearly recombining previous iterates in a window of size w, 
(uj_ w _|_X) • ■ • i u i)i using a nonlinear GMRES approach. (The details of this step will 
be recalled in Section [5] below.) In the third step, a line search is performed that 
minimizes objective function f (u) on a half line starting at preliminary iterate Uj+i, 
which was generated in Step I, and connecting it with accelerated iterate £ij+i, which 
was generated in Step II, to obtain the new iterate u,+i. 

The second step in the N-GMRES optimization algorithm (Step II in Algorithm 
[T]) uses the nonlinear extension of GMRES for solving nonlinear systems of equations 
that was proposed by Washio and Oosterlee in [T5] in the context of nonlinear partial 
differential equation (PDE) systems (see also [12] and [18] for further applications 
to PDE systems). It is a nonlinear extension of the celebrated GMRES method for 
iteratively solving systems of linear equations [15] [M]. Washio and Oosterlee's non- 
linear extension is related to Flexible GMRES as described in [13] , and is also related 
to the reduced rank extrapolation method |16j . An early description of this type of 
nonlinear iterate acceleration ideas for solving nonlinear equation systems appears in 
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so-called Anderson mixing, see, e.g., [3J[T7]. More recent applications of these ideas to 
nonlinear equation systems and fixed-point problems arc discussed in [5J I17| . In Pa- 
per I we formulated a nonlinear GMRES optimization algorithm for canonical tensor 
decomposition that uses this type of acceleration as one of its steps, combined with an 
ALS preconditioning step and a line search for globalization. The type of nonlinear 
iterate acceleration in Step II of Algorithm [T] has thus been considered several times 
before in the context of solving nonlinear systems of equations, but we believe that 
its combination with a line search to obtain a general preconditioned nonlinear opti- 
mization method as in Algorithm [1] (see Paper I) is new in the optimization context. 
In the present paper we show how this N-GMRES optimization approach can be ap- 
plied to a broad class of sufficiently smooth nonlinear optimization problems by using 
steepest descent preconditioning. We establish theoretical convergence properties for 
this approach and demonstrate its effectiveness in numerical tests. 



Algorithm 1: N-GMRES optimization algorithm (window size w) 
Input: w initial iterates Uo, . . . , u^-i- 

i = w — 1 
repeat 

Step I: (generate preliminary iterate by one-step update process M(.)) 
u l+1 = M(iii) 

Step II: (generate accelerated iterate by nonlinear GMRES step) 

u i+ i =gmrcs(u i _ lu+ i, . . . , u 4 ; u i+ i) 
Step III: (generate new iterate by line search process) 

if — u 1+ i is a descent direction 

Ui+i =linesearch(u i+ i + f3(u l+1 - u l+1 )) 

else 

u l+ i = u i+ i 

end 

i = i + 1 

until convergence criterion satisfied 



(Note that the w initial iterates required in Algorithm [T] can naturally be generated 
by applying the algorithm with a window size that gradually increases from one up to 
w, starting from a single initial guess. Also, as in [3], we perform a restart and reset 
the window size back to 1 whenever Ui+i — Ui+i is not a descent direction.) 

The rest of this paper is structured as follows. In Section[2]we propose two types of 
steepest descent preconditioned for N-GMRES Optimization Algorithm[T] We briefly 
recall the details of the nonlinear GMRES optimization step, give a motivation and in- 
terpretation for steepest descent preconditioning that relate it to non-preconditioned 
GMRES for SPD linear systems, and give a simple proof for global convergence of 
the N-GMRES optimization algorithm using steepest descent preconditioning with 
line search. In Section [3] we present extensive numerical results for N-GMRES opti- 
mization with the two proposed steepest descent prcconditioncrs, applied to a variety 
of nonlinear optimization problems, and compare with stand-alone steepest descent, 
N-CG and L-BFGS. Finally, Section H concludes. 

2. Steepest Descent Preconditioning for N-GMRES Optimization. In 

this section, we first propose two variants of steepest descent preconditioning. We 
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Fig. 1.1. Schematic representation of one iteration of the N-GMRES optimization algorithm 
(from Wf ). Given previous iterations uo, ui and U2, new iterate 113 is generated as follows. In Step 
I, preliminary iterate U3 is generated by the one-step update process Af(.): U3 = M(u2). In Step II, 
the nonlinear GMRES step, accelerated iterate U3 is obtained by determining the coefficients ctj in 
U3 = U3 + aodo + c«idi +«2d2 such that the gradient of the objective function in U3 is approximately 
minimized. In Step III, the new iterate, 113, is finally generated by a line search that minimizes the 
objective function /(Q3 + /3(u3 — Q3)). 

then briefly recall the details of the nonlinear GMRES recombination step (Step II 
in Algorithm [T]), and relate N-GMRES optimization to standard non-preconditioned 
GMRES for linear systems in the case of a simple quadratic optimization problem with 
SPD operator. Finally, we give a simple global convergence proof for the N-GMRES 
optimization algorithm using steepest descent preconditioning with line search. 

2.1. Steepest Descent Preconditioning Process. We propose a general steep- 
est descent preconditioning process for Step I of N-GMRES Optimization Algorithm 
[T]with the following two variants: 

Steepest Descent Preconditioning Process: 

with 

Psdls, (2.1) 

/3 sd = min(<J, ||V/(ui)||). (2.2) 

For Option A, f3 s( n s is the step length obtained by a line search procedure. For 
dcfinitencss, we consider a line search procedure that satisfies the Wolfe conditions (see 
below). We refer to the steepest descent preconditioning process with line search (|2.ip 
as the sdls preconditioner. For Option B, we predefine the step /3 s( i as the minimum 
of a small positive constant 6, and the norm of the gradient. In the numerical results 
to be presented further on in the paper, we use S = 10~ 4 , except where noted. We 
refer to the steepest descent preconditioning process with predefined step j3 sd - (|2.2[) as 
the sd preconditioner. These two Options are quite different, and some discussion is 
in order. 

Preconditioning process A can be employed as a stand-alone optimization method 
(it can converge by itself), and N-GMRES can be considered as a wrapper that accel- 
erates this stand-alone process. We will show below that N-GMRES with precondi- 
tioning process A has strong convergence properties, but it may be expensive because 
the line search may require a significant number of function and gradient (/ / g) eval- 
uations. However, the situation is very different for preconditioning process B. Here, 
no additional f / g evaluations are required, but convergence appears questionable. It 



Ui+i =Ui-/3- 



|v/K)|| 

option A: p 
OPTION B: j3 
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is clear that preconditioning process B cannot be used as a stand-alone optimization 
algorithm; in most cases it would not converge. It can, however, still be used as a pre- 
conditioning process for N-GMRES. As is well-known and will be further illustrated 
below, preconditioncrs used by GMRES for linear systems do not need to be conver- 
gent by themselves, and this suggests that it may be interesting to consider this for 
N-GMRES optimization as well. As will be motivated further below, the role of the 
N-GMRES preconditioning process is to provide new 'useful' directions for the nonlin- 
ear generalization of the Krylov space, and the iteration can be driven to convergence 
by the N-GMRES minimization, even if the preconditioner is not convergent by itself. 
However, for this to happen in the three-step N-GMRES optimization algorithm with 
preconditioning process B, it is required that u i+1 eventually approaches Uj and the 
step length (3 s d approaches 0. For this reason, we select /3 S£ ; = ||V/(uj)|| as soon as 
l|V/(uj)|| < 6. The initial step length /3 s( j is chosen to be not larger than a small 
constant because the linear case (see below) suggests that a small step is sufficient 
to provide a new direction for the Krylov space, and because the minimization of the 
residual is based on a linearization argument (see also below), and small steps tend 
to lead to small linearization errors. 

2.2. N-GMRES Recombination Step. Before relating steepest-descent pre- 
conditioned N-GMRES to non-preconditioned GMRES for linear systems, we first re- 
call from [3] some details of the N-GMRES recombination step, Step II in Algorithm 
[TJ In this step, we find an accelerated iterate Uj+i that is obtained by recombining 
previous iterates as follows: 

i 

Uj+1 = Uj+1 + ^ U 3 ( U «+l _ U J')- ( 2 ' 3 ) 

3=0 

The unknown coefficients aj are determined by the N-GMRES algorithm in such a 
way that the two-norm of the gradient of the objective function evaluated at the 
accelerated iterate is small. In general, g(.) is a nonlinear function of the atj, and 
linearization is used to allow for inexpensive computation of coefficients oij that may 
approximately minimize ||g(uj + i)||2. Using the following approximations 



g(ui+i) ~ g(u i+ i) + J2 C 



du 



"j (ui+i - u i) 



« g(u,+i) + a J (g( fl i+0 - SK')) (2-4) 

3=0 

one arrives at minimization problem 

find coefficients (ao, . . . , ctj) that minimize 

i 

|jg(u i+1 ) + ]T aj (g(u i+1 ) - g(u,))|| 2 . (2.5) 

3=0 

This is a standard least-squares problem that can be solved, for example, by using the 
normal equations, as explained in [181 [3]. (In this paper, we solve the least-squares 
problem as described in [3].) 

In a windowed implementation with window size w, the memory cost incurred 
by N-GMRES acceleration is the storage of w previous approximations and residuals. 
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The dominant parts of the CPU cost for each acceleration step are the cost of building 
and solving the least-squares system (which can be done in approximately 2nw flops 
if the normal equations are used and some previous inner products are stored, see 
|18|). and nw flops to compute the accelerated iterate. For problems with expensive 
objective functions, this cost is often negligible compared to the cost of the f/g 
evaluations in the line searches [3]. 

2.3. Motivation and Interpretation for Steepest Descent Precondition- 
ing. Consider a standard quadratic minimization problem with objective function 

/(u) = iu T Au-b T u, (2.6) 

where A is SPD. It is well-known that its unique minimizer satisfies ^4u = b. Now 
consider applying the N-GMRES optimization algorithm with steepest descent precon- 
ditioner to the quadratic minimization problem. The gradient of / at approximation 
Ui is given by 

V/(ui) = Aui -b = -ri with r; = b - Au,, (2.7) 

where Ti is defined as the residual of the linear system Au = b in u^. N-GMRES 
steepest descent preconditioner (|2.1j) - (|2.2[) then reduces to the form 

u i+ i = Ui + /3-jr^|7, (2.8) 

and it can easily be shown that this corresponds to the stationary iterative method 
that generates the Krylov space in non-preconditioned linear GMRES applied to Au = 
b. We now briefly show this because it provides further insight (recalling parts of the 
discussion in [TBI 13]). 

We first explain how preconditioned GMRES for ^4u = b works. Consider so- 
called stationary iterative methods for Au = b of the following form: 

\i i+1 = u i + M- 1 r i . (2.9) 

Here, matrix M is an approximation of A that has an easily computable inverse, i.e., 
M _1 rj A -1 . For example, M can be chosen to correspond to Gauss-Seidel or Jacobi 
iteration, or to a multigrid cycle [18] . 

Consider a sequence of iterates Uo , . . . , Uj generated by update formula (|2.9[) , 
starting from some initial guess Uo. Note that the residuals of these iterates are 
related as r, = b - A u t = (I - AM _1 )r,_i = (I - AM" 1 ) 1 ^. This motivates the 
definition of the following vector spaces: 

Vij+i = span{r , . . . ,r,}, 

V iti+1 = span{T , AM- 1 r , (AM' 1 ) 2 r }, . . . , (AM' 1 )' 1 r } 

= K l+1 {AM~\v ), 
Vs.i+i = span{M (u l+1 - u ), M (u i+1 - m), . . . , M (u i+ i - u.j)}. 

Vector space V2,i+i is the so-called Krylov space Ki + i(AM~ 1 1 ro) of order i + 1, 
generated by matrix AM -1 and vector ro. It is easy to show that these vector spaces 
are equal (see, e.g., [TgllS]). 

Expression ([2l?j) shows that M (u t+1 - u. t ) G K l+ i (AM -1 , r ). The GMRES 
procedure can be seen as a way to accelerate stationary iterative method (|2.9[) . by 
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recombining iterates (or, equivalcntly, by reusing residuals). In particular, we seek a 
better approximation Uj+i, with M (iii+i — Uj) in the Krylov space K i+ i(AM~ 1 , r ), 
such that = b — Aui + i has minimal two-norm. In other words, we seek optimal 
cocfhcicnts fij in 



and it is easy to show that this corresponds to seeking optimal coefficients o.j in 



such that ||fj_|_i||2 is minimized (which leads to a small least-squares problem equiv- 
alent to (|2.5p ). Note that V\a+\ and V^i+i do not easily generalize to the nonlinear 
case, but the image of V\j,+\ under M , span{ui + i — uo, Uj+i — Ui, . . . , Uj+i — u,}, 
does generalize naturally and is taken as the 'generalized Krylov space' that is used 
to seek the approximation in the nonlinear case. 

Up to this point, we have presented GMRES as a way to accelerate one-step 
stationary iterative method (|2.9|) . A more customary way, however, to see GM- 
RES is in terms of preconditioning. The approach described above reduces to 'non- 
preconditioned' GMRES when one sets M = I. Applying non-preconditioned GM- 
RES to the preconditioned linear equation system AM~ 1 (Mu) = b also results in the 
expressions for preconditioned GMRES derived above. In this viewpoint, the matrix 
M _1 is called the preconditioner matrix, because its role is viewed as to pre-condition 
the spectrum of the linear system operator such that the (non-preconditioned) GM- 
RES method applied to (AM~ 1 )y = b becomes more effective. It is also customary to 
say that the stationary iterative process preconditions GMRES (for example, Gauss- 
Seidel or Jacobi can precondition GMRES). We can summarize that the role of the 
stationary iterative method is to generate preconditioned residuals that build the 
Krylov space. 

In the presentation above, all iterates Uj for j = 0,...,i (for instance, in the 
right-hand side of p.lOp ) refer to the unaccelerated iterates generated by stationary 
iterative method (|2.9I) . However, the formulas remain valid when accelerated iterates 
are used instead; this does change the values of the coefficients oij, but leads to the 
same accelerated iterates [T8]. This is so because the Krylov spaces generated in the 
two cases are identical due to linearity, and consequently GMRES selects the same 
optimal improved iterate. 

This brings us to the point where we can compare steepest-descent preconditioned 
N-GMRES applied to quadratic objective function (|2.6p with SPD operator A, to 
non-preconditioned linear GMRES applied to An = b. Assume we have w previous 
iterates U, and residuals . Stationary iterative process (|2.9p without preconditioner 
(M = I) would add a vector to the Krylov space which has the same direction as 
the vector that would be added to it by the steepest descent preconditioning process 
(|2.8p . This means that the accelerated iterate u,+i produced by N-GMRES with 
steepest descent preconditioner applied to quadratic objective function f|2 . 6|) with SPD 
operator A is the same as the accelerated iterate u,+i produced by linear GMRES 
with identity preconditioner applied to ^4u = b. This motivates our proposal to use 




3=0 




(2.10) 



3=0 
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steepest descent preconditioning as the natural and most basic preconditioning process 
for the N-GMRES optimization algorithm applied to general nonlinear optimization 
problems. 

Note that, in the case of linear systems, the efficiency of GMRES as an acceleration 
technique for stationary iterative methods can be understood in terms of how optimal 
polynomials can damp modes that are slow to converge |TSj [Mj. In the case of 
N-GMRES for nonlinear optimization, if the approximation is close to a stationary 
point and the nonlinear residual vector function g(.) can be approximated well by 
linearization, then it can be expected that the use of the subspace span{\ii + \ — 
Uo, M-i+i — Ui, . . . , Uj+i — Ui} for acceleration may give efficiency similar to the linear 
case [18] . Note finally that the above also explains why a small step is allowed in the 
sd preconditioncr of (|2.2j) (basically, in the linear case, the size of the coefficient does 
not matter for the Krylov space), and the linearization argument of (|2.4|) indicates 
that a small step may be beneficial. 

2.4. Convergence Theory for N-GMRES Optimization with Steepest 
Descent Preconditioning. We now formulate and prove a convergence theorem 
for N-GMRES Optimization Algorithm [T] using steepest descent preconditioning with 
line search (|2.1j) . We assume that all line searches provide step lengths that satisfy 
the Wolfe conditions [TP] : 

SUFFICIENT DECREASE CONDITION: 

/(iH + ftPi) < /(u<) + ci Pi V/( Ul ) T Pj , (2.11) 

CURVATURE CONDITION: 

V/( Ui +ft Pi ) T Pi > C 2 V/( Ui ) T P i, (2.12) 

with < ci < C2 < 1. Condition (|2.11[) ensures that large steps are taken only if they 
lead to a proportionally large decrease in /. Condition (|2.12|) ensures that a step is 
taken that is large enough to sufficiently increase the gradient of / in the line search 
direction (make it less negative). Global convergence (in the sense of convergence to 
a stationary point from any initial guess) can then be proved easily using standard 
approaches [51 HP]. 

Theorem 2.1 (Global convergence of N-GMRES optimization algorithm with 
steepest descent line search preconditioning). Consider N-GMRES Optimization 
Algorithm [7] with steepest descent line search preconditioning \2. 1\) for Optimiza- 
tion Problem I, and assume that all line search solutions satisfy the Wolfe condi- 
tions, H2.ll]) and H2.12\) . Assume that objective function f is bounded below in ffi™ 
and that f is continuously differ entiable in an open set Af containing the level set 
C = {u : /(u) < /(uo)}, where Uo is the starting point of the iteration. Assume also 
that the gradient V / is Lipschitz continuous on Af, that is, there exists a constant L 
such that ||V/(u) — V/(u)|| < £||u— u|| for all u, u G Af . Then the sequence of N- 
GMRES iterates {uo, Ui, . . .} is convergent to a fixed point of Optimization Problem 
I in the sense that 

lim ||V/(Ui)|| = 0. (2.13) 

Proof. Consider the sequence {vo,Vi, . . .} formed by the iterates uo, Ui, Ux, Q2, 
U2, ... of Algorithm I, but with removed if — is not a descent direction in 
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Step III of the algorithm. Then all iterates v, arc of the form = Vj_i + /3j_iPj_i, 
with Pi-i a descent direction and such that the Wolfe conditions arc satisfied. 
According to Theorem 3.2 of [TU] (p. 38, Zoutcndijk's Theorem), we have that 

oo 

5^cos 2 0i||V/(vi)|| 2 <oo, (2.14) 



i=0 



with 



which implies that 



l|V/(vi)|| ||pi|| 



lim cos 2 0i ||V/(vi)|| 2 = 0. (2.16) 

i— >-oo 



Consider the subsequence {||V/(Uj)||} of {||V/(vi)||}. Since all the u, are followed 
by a steepest descent step in the algorithm, the 6% corresponding to all the el- 
ements of {||V/(us)||} satisfy cos#i = 1. Therefore, it follows from (|2.16j) that 
limj^oo ||V/(uj)|| = 0, which concludes the proof. □ 

Note that the notion of convergence (|2.13[) we prove in Theorem l2.1l for N-GMRES 
optimization with steepest descent line search preconditioning is stronger than the 
type of convergence that can be proved for some N-CG methods J6j[10], namely, 



lim inf ||V/( Ul )|| =0. (2.17) 

i— >oo 

Also, it appears that, in the proof of Theorem l2.11 we cannot guarantee that sequence 
{|jV/(u.;)|j} converges to 0. We know that sequence {/(vj)} converges to a value 
/* since it is nonincreasing and bounded below, but it appears that the properties 
of the line searches do not guarantee that the sequence {||V/(vi)||} converges to 0. 
They do guarantee that the subsequence {||V/(u,)||} converges to 0, but it cannot be 
ruled out that, as the /(uj) approach /* and the ||V/(uj)|| approach 0, large steps 
with very small decrease in / may still be made from each to the next Uj+i (large 
steps with small decrease are allowed in this case since the approach a stationary 
point), while, at the same time, large steps with very small decrease in / may be 
made from the Uj+i to the next Uj+i (large steps with small decrease are allowed in 
this case if the search direction p from u,+i is such that V/(tii + i) T p is very close to 
0). These large steps may in principle preclude {|| V/(iij)||} from converging to (but 
we do not observe such pathological cases in our numerical tests). Nevertheless, we 
are able to prove the strong convergence result (|2.13[) for the iterates of N-GMRES 
optimization with steepest descent line search preconditioning: sequence {||V/(uj)||} 
converges to 0. 

3. Numerical Results. We now present extensive numerical results for the N- 
GMRES optimization algorithm with steepest descent preconditioners (|2.ip and (12.21) , 
compared with stand-alone steepest descent optimization, N-CG and L-BFGS. 

In all tests, we utilize the More-Thuente line search method [8] and the N-CG and 
L-BFGS optimization methods as implemented in the Poblano toolbox for Matlab [4]. 
For all experiments, the More-Thuente line search parameters used were as follows: 
function value tolerance c\ = 10~ 4 for (|2.1ip . gradient norm tolerance C2 = 10 -2 for 
(I2.12|) . starting search step length j3 — 1, and a maximum of 20 f/g evaluations are 
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used. These values were also used for the N-CG and L-BFGS comparison runs. We 
use the N-CG variant with Polak-Ribicre update formula, and the two-loop recursion 
version of L-BFGS [TO]. We normally choose the N-GMRES window size w equal to 
20, which is confirmed to be a good choice in numerical tests described below. The 
L-BFGS window size is chosen equal to 5 (we found that larger window sizes tend 
to harm L-BFGS performance for the tests we considered). All initial guesses are 
determined uniformly randomly with components in the interval [0, 1], and when we 
compare different methods they are given the same random initial guess. All numerical 
tests were run on a laptop with a dual-core 2.53 GHz Intel Core i5 processor and 4GB 
of 1067 MHz DDR3 memory. Matlab version 7.11.0.584 (R2010b) 64-bit (maci64) 
was used for all tests. 

3.1. Test Problem Description. We first describe the seven test problems we 
consider. In what follows, all vectors are chosen in R™, and all matrices in R nx ™. 
Problem A. (Quadratic objective function with spd diagonal matrix.) 

/(u) = i(u-u*) T £(u-u*) + l, (3.1) 

with D = diag(l, 2, . . . , n). 

This problem has a unique minimizer u* in which /* = /(u*) = 1. We choose 
u* = (1, . . . , 1). Note that g(u) = D(u — u*), and the condition number of D is given 
by k — n. It is well-known that for problems of this type large condition numbers tend 
to lead to slow convergence of the steepest descent method due to a zig-zag effect. 
Problem A can be used to show how methods like N-CG and N-GMRES improve over 
steepest descent and mitigate this zig-zag effect. 

Problem B. (Problem A with paraboloid coordinate transformation.) 

/(u) = iy(u-u*) T ^y(u-u*) + l, (3.2) 

with D = diag(l, 2, . . . , n) and y(x) given by 

2/1 (x) = x\ and j/j(x) = Xi - 10 xf (i = 2, . . . ,n). 

This modification of Problem A still has a unique minimizer u* in which /* = 
/(u*) = 1. We choose u* = (1,...,1). The gradient of /(u) is given by g(u) = 
Dy(u - u*) - 20 (ui - Ui)(E"= 2 (- D y( u - u *))j) [1,0,..., 0] T . This modification 
of Problem A increases nonlincarity (the objective function is now quartic in u) and 
changes the level surfaces from ellipsoids into parabolically skewed ellipsoids. As such, 
the problem is more difficult for nonlinear optimization methods. For n = 2, the level 
curves are modified from elliptic to 'banana-shaped'. In fact, the objective function 
of Problem B is a multi-dimensional generalization of Rosenbrock's 'banana' function. 

Problem C. (Problem B with a random non-diagonal matrix with condi- 
tion number n = n.) 

/(u) = iy(u-u*) T Ty(u-u*) + l, (3.3) 

with T = Qdiag(l, 2, . . . , n) Q T , where Q is a 
random orthogonal matrix and y(x) is given by 
yi(x) = xi and y.^x) = x t - \0x\ (i = 2, . . . ,n). 
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This modification of Problem B still has a unique minimizcr u* in which /* = 
/(u*) = 1. Wc choose u* = (1,...,1). The gradient of /(u) is given by g(u) = 
Ty(u - u*) - 20 (ui - u{) (£™ =2 (:Ty(u ~ u *))j) I 1 - 0, ... , 0] T . The random matrix Q 
is the Q factor obtained from a QR-factorization of a random matrix with elements 
uniformly drawn from the interval [0,1]. This modification of Problem B introduces 
nonlinear 'mixing' of the coordinates (cross-terms) and further increases the difficulty 
of the problem. 
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Fig. 3.1. Problem A (n = 100J. Convergence histories of the 10-logarithms of |/(uj) — f*\ 
and ||g(iij)|| as a function of iterations and f/g evaluations. N-GMRES-sdls is the N-GMRES 
optimization algorithm using steepest descent preconditioning with line search, N-GMRES-sd is the 
N-GMRES optimization algorithm using steepest descent preconditioning with predefined step, N-CG 
is the Polak-Ribiere nonlinear conjugate gradient method, L-BFGS is the limited-memory Broyden- 
Fletcher-Goldfarb-Shanno method, and sdls is the stand-alone steepest descent method with line 
search. 

Problem D. (Extended Rosenbrock function, problem (21) from 

1 " 

/(u) = - ^ ij( u ), w hh n even and 

i=i 

tj = 10 (u j+1 - u)) (jodd), 
t j = 1 — Uj-i (j even). 

Note that g(u) can easily be computed using </fc(u) = tj dtj/duk (k = 1, . . . , n). 
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(a) convergence to f, N-GMRES with sd preconditioner 



(b) gradient norm convergence, N-GMRES with sd preconditioner 
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Fig. 3.2. Problem A (n = 100). Effect of varying window size w on |/(u;) — f*\ and ||g(uj)|| 
convergence for N-GMRES-sdls and N-GMRES-sd optimization as a function of f/g evaluations. 
Window size w = 20 emerges as a suitable choice, leading to rapid convergence. These results 
give some general indication that, if sufficient memory is available, w = 20 may be a good choice. 
However, if memory is scarce, w = 3 already provides good results, especially for N-GMRES-sd. 



Problem E. (Brown almost-linear function, problem (27) from pi]. 

1 " 

/(u) = - £ if (u), with 



t 3 = u j + (Yl u i) - (n + 1) (j < n), 

i=l 

n 

t n = - 1. 

i=l 

Problem F. (Trigonometric function, problem (26) from [§].) 
/(u) = lf^ 2 (u), with 

" 3=1 

n 
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Problem G. (Penalty function I, problem (23) from [§]■) 

n 

/(")= 2«E ^ 2 (u))+4 + i(u)), with 
t j = y/W^(u j -l) (j = l,...,n), 

n 

t n+1 = (J2 O - 0.25. 

i=l 
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Fig. 3.3. Problem B (n = 100J. Convergence comparison. 

3.2. Numerical Results for Problems A C. We first present some conver- 
gence plots for instances of Problems A-C. Fig. I3.ll shows results for an instance of 
Problem A. We see that stand-alone steepest descent with line search (sdls) converges 
slowly, which is expected because the condition number of matrix D is n = 100. Both 
N-GMRES optimization using steepest descent preconditioning with line search (|2.ip 
(N-GMRES-sdls) and N-GMRES optimization using steepest descent preconditioning 
with predefined step (|2.2p (N-GMRES-sd) are significantly faster than stand-alone 
sdls, in terms of iterations and f /g evaluations, confirming that the N-GMRES accel- 
eration mechanism is effective, and steepest descent is an effective preconditioner for 
it. As could be expected, the preconditioning line searches of N-GMRES-sdls add sig- 
nificantly to its / 1 g evaluation cost, and N-GMRES-sd is more effective. N-GMRES 
accelerates steepest descent up to a point where performance becomes competitive 
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(a) convergence to f* 



(b) convergence of the gradient norm 
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Fig. 3.4. Problem C (n = 100). Convergence comparison. 



with N-CG and L-BFGS. It is important to note that convergence profiles like the 
ones presented in Fig. 13. II tend to show significant variation depending on the random 
initial guess. The instances presented are arbitrary and not hand-picked with a spe- 
cial purpose in mind (they simply correspond to seed in our matlab code) and we 
show them because they do provide interesting illustrations and show patterns that 
we have verified to be quite general over many random instances. However, they can- 
not reliably be used to conclude on detailed relative performance of various methods. 
For this purpose, we provide tables below that compare performance averaged over a 
set of random trials. 

Fig. l3.2l shows the effect of varying the window size w on |/(u.j) — f* \ and ||g(uj)|| 
convergence for N-GMRES-sdls and N-GMRES-sd optimization as a function of f/g 
evaluations, for an instance of Problem A. Window size w = 20 emerges as a suitable 
choice if sufficient memory is available, leading to rapid convergence. However, win- 
dow sizes as small as w = 3 already provide good results, especially for N-GMRES-sd. 
This indicates that satisfactory results can be obtained with small windows, which 
may be useful if memory is scarce. We use window size w = 20 for all numerical 
results in this paper. 

Fig. 13.31 shows results for an instance of Problem B, which is a modification of 
Problem A introducing more nonlinearity, and Fig. l3.4l shows results for the even more 
difficult Problem C, with random nonlinear mixing of the coordinate directions. Both 
figures show that stand-alone sdls is very slow, and confirm that N-GMRES-sdls and 
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N-GMRES-sd significantly speed up steepest descent. For Problem B, N-GMRES- 
sdls, N-GMRES-sd, N-CG and L-BFGS perform similarly, but for the more difficult 
Problem C N-GMRES-sdls, N-GMRES-sd and L-BFGS perform much better than 
N-CG. 



problem 


N-GMRES-sdls 


N-GMRES-sd 


N-CG 


L-BFGS 


A n=100 


242 


111 


84 


73 


A n=200 


406 


171 


127 


104 


B n=100 


1200 


395 


198 


170 


B n=200 


1338 


752 


606 


321 


C 77=100 


926(1) 


443 


13156(7) 


151 


C n=200 


1447 


461 


26861(9) 


204 



Table 3.1 



Average number of f/g evaluations needed to reach |/(Uj) — f*\ < 10 -6 for 10 instances of 
Problems A-C with random initial guess and with different sizes. Numbers in brackets give the 
number of random trials (out of 10) that did not converge to the required tolerance within 1500 
iterations (if any). 




Fig. 3.5. Problem D (n = 1000,). Convergence comparison. 



Table 13.11 confirms the trends that were already present in the specific instances 
of test problems A-C that were shown in Figures 13.11 13.31 and 13.41 The table gives 
the average number of f/g evaluations that were needed to reach |/(u,) — /*| < 10 -6 
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for 10 random instances of Problems A-C with different sizes. For Problems A and 
B, N-GMRES-sdls and N-GMRES-sd consistently give f/g evaluation counts that 
are of the same order of magnitude as N-CG. N-GMRES-sd comes close to being 
competitive with N-CG. L-BFGS is the fastest method for all problems in Table [3~T1 
For the more difficult Problem C, both N-GMRES-sdls, N-GMRES-sd and L-BFGS 
are significantly faster than N-CG, which appears to have convergence difficulties for 
this problem. N-GMRES-sd is clearly faster than N-GMRES-sdls for all tests. 

3.3. Numerical Results for Problems D— G. Figure 13.51 gives convergence 
plots for a single instance of Problem D. It confirms the observations from Figures 
I3.lll3.3l and l3.4l for this standard test problem from [9], stand-alone sdls again is very 
slow, and N-GMRES-sdls and N-GMRES-sd significantly speed up steepest descent 
convergence. N-GMRES-sdls and N-GMRES-sd have iteration and f/g counts that 
are of the same order of magnitude as N-CG and L-BFGS, and in particular N- 
GMRES-sd is competitive with N-CG and L-BFGS. Convergence plots for instances 
of Problems E-G show similar behaviour and are not presented. 



problem 


N-GMRES-sdls 


N-GMRES-sd 


N-CG 


L-BFGS 


D n=500 


525 


172 


222 


166 


D n=1000 


445 


211 


223 


170 


E n=100 


294 


259 


243 


358 


E n=200 


317 


243 


240 


394 


F n=200 


140 


102(1) 


102 


92 


F n=500 


206(1) 


175(1) 


135 


118 


G n=100 


1008(2) 


152 


181 


358 


G n=200 


629(1) 


181 


137 


240 



Table 3.2 



Average number of f/g evaluations needed to reach |/(u;) — f*\ < 10~ 6 for 10 instances of 
Problems D-G with random initial guess and with different sizes. Numbers in brackets give the 
number of random trials (out of 10) that did not converge to the required tolerance within 500 
iterations (if any). 

Table 13.21 on / / g evaluation counts for Problems E-G again confirms the trends 
that were observed before. N-GMRES-sdls and N-GMRES-sd give f/g evaluation 
counts that are of the same order of magnitude as N-CG and L-BFGS, and N-GMRES- 
sd in particular is competitive with N-CG and L-BFGS. 

4. Conclusion. In this paper, we have proposed and studied steepest descent 
preconditioning as a universal preconditioning approach for the N-GMRES optimiza- 
tion algorithm that we recently introduced in the context of a canonical tensor ap- 
proximation problem and ALS preconditioning [3] (Paper I) . We have considered two 
steepest descent preconditioning process variants, one with a line search, and the other 
one with a predefined step length. The first variant is significant because we showed 
that it leads to a globally convergent optimization method, but the second variant 
proved more efficient in numerical tests, with no apparent degradation in convergence 
robustness. Numerical tests showed that the two steepest-descent preconditioned 
N-GMRES methods both speed up stand-alone steepest descent optimization very 
significantly, and are competitive with standard N-CG and L-BFGS methods, for 
a variety of test problems. These results serve to theoretically and numerically es- 
tablish steepest-descent preconditioned N-GMRES as a general optimization method 
for unconstrained nonlinear optimization, with performance that appears promising 
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(a) convergence to f* 
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Fig. 4.1. Convergence histories of the 10-logarithm of |/(u;) — /* as a function of f/g eval- 
uations, for the canonical tensor approximation problem of Figures 1.2 and 1.3 in [31 . Panel (a) 
shows that stand-alone sdls is very slow for this problem, and N-GMRES-sdls and N-GMRES-sd 
significantly speed up steepest descent. However, for this difficult problem, it is beneficial to use 
a more powerful nonlinear preconditioner. Using the ALS preconditioner in stand-alone fashion 
already provides faster convergence than N-GMRES-sdls and N-GMRES-sd. The zoomed view in 
Panel (b) shows that N-CG and L-BFGS are faster than stand-alone ALS when high accuracy is 
required, but N-GMRES preconditioned with the powerful ALS preconditioner is the fastest method 
by Jar, beating N-CG and L-BFGS by a factor of 2 to 3. This illustrates that the real power of 
the N-GMRES optimization algorithm may lie in its ability to employ powerful problem- dependent 
nonlinear preconditioners (ALS in this case). 
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compared to established techniques. 

However, we would like to argue that the real potential of the N-GMRES op- 
timization framework lies in the fact that it can use problem-dependent nonlinear 
preconditioners that are more powerful than steepest descent. Preconditioning of N- 
CG in the form of (linear) variable transformations is an area of active research [7]. 
However, it is interesting to note that our N-GMRES optimization framework natu- 
rally allows for a more general type of preconditioning: any nonlinear optimization 
process M(.) can potentially be used as a nonlinear preconditioner in the frame- 
work, or, equivalently, N-GMRES can be used as a simple wrapper around any other 
iterative optimization process M(.) to seek acceleration of that process. This can 
be illustrated with the following example, in which we first apply N-GMRES with 
the steepest descent preconditioners proposed in this paper, to a canonical tensor 
approximation problem from [3J. (In particular, we consider the canonical tensor ap- 
proximation problem of Figures 1.2 and 1.3 in [3], in which a rank-three canonical 
tensor approximation (with 450 variables) is sought for a three-way data tensor of 
size 50 x 50 x 50.) Panel (a) of Fig. |4~T1 shows how stand-alone steepest descent (sdls) 
is very slow for this problem: it requires more than 30,000 f/g evaluations. (The 
tensor calculations are performed in matlab using the Tensor Toolbox [2]. For this 
problem, we use 5 = 10~ 3 in 1(23]).) The GMRES-sdls and N-GMRES-sd convergence 
profiles confirm once more one of the main messages of this paper: steepest-descent 
preconditioned N-GMRES speeds up stand-alone steepest descent very significantly. 
However, steepest descent preconditioning (which we have argued is in some sense 
equivalent to non-preconditioned GMRES for linear systems) is not powerful enough 
for this difficult problem, and a more advanced preconditioner is required. Indeed, 
Panel (a) of Fig. 14.11 shows that the stand-alone ALS process is already more efficient 
than steepest-descent preconditioned N-GMRES. Panel (b) indicates, however, that 
N-GMRES preconditioned by ALS is a very effective method for this problem: it 
speeds up ALS very signficantly, and is much faster than N-CG and L-BFGS, by a 
factor of 2 to 3. (Panel (b) of Fig. 14. 1 1 illustrates the findings from extensive tests com- 
paring ALS, N-CG and ALS-preconditioned N-GMRES that were reported in Paper 
I and [T].) 

In the case of GMRES for linear systems, non-preconditioned GMRES (or: GM- 
RES with the identity preconditioner) is often just a starting point. For many difficult 
problems it converges too slowly, and there is a very extensive and ever expanding 
research literature on developing advanced problem-dependent preconditioners that 
in many cases speed up convergence very significantly. In the same way, the present 
paper is likely not more than a starting point in theoretically and numerically estab- 
lishing the N-GMRES optimization method with general steepest descent precondi- 
tioning process. As the results shown in Fig. 14.11 already indicate, we expect that 
the real power of the N-GMRES optimization framework will turn out to lie in its 
ability to use powerful problem-dependent nonlinear preconditioners. This suggests 
that further exploring N-GMRES optimization with advanced preconditioners may 
lead to efficient numerical methods for a variety of nonlinear optimization problems. 
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